This project is an analysis of pledges for financing submitted to kickstater, a crowdfunding platform where entrepreneurs subtmit their ideas for funding. The dataset was obtained from Kaggle. For this analysis we will use libraries from the Tidyverse, a set of packages developed primarily by Hadley Wickham.
We start by loading the libraries.
Now that we have the workding directory we can just use the relative path of the GitHub repo where the data is also mantained. The next step in the analysis is loading the data. The readr library is quite helpful here as the original file is in a zip format, but the library takes care of the heavylifting for us.
The data set was created by user Kermical via webscrape the plattform. It is composed of 15 variables and has 378K+ observations. The variables indicate the name of the project, the category of the project (e.g. food, restaurants, film and video, etc.), when was it launched, currency, number of backers, status of the project (succesful vs failed), etc.
The first variable I am interested in exploring is goal (funding requested). This will provide a good idea on the range of projects that are pitched in the plattform. We start this exploration with a histogram. To define the size of the bins we use one half of the standard deviation of our variable of interest.
summary(kickstart_data$goal)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2000 5200 49081 16000 100000000
As we can see from the plot there a few observations with extremely large values. This can be explained by the fact that the goal variable is not standardized in a common currency, additionally there could be just a high variation on the amount of money a given project requires. Let’s explore how the distribution looks if we used the standardized goal requested in USD.
summary(kickstart_data$usd_goal_real)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 2000 5500 45454 15500 166361391
We still have a large number of outliers in this case we will filter everything that is above 1.5 times the interquartile range, defined as the difference between the 75th and 25th percentiles. If we follow this approach we end up with 303,549 observations. This represents a loss of 19.7 percent of all the observations. To mitigate this problem I adopted a more lax definition of IQR and set the limit to 3.5 times the difference between the third and first quartile, with this criteria we have 341,563 observations, or 90 percent of the original dataset. The resulting histogram of funds requested without outliers is below.
We now look at pledged funds (funds actully collected) in local currency. We also provide some summary statistics using the summary command. We will try to assess if the funds pledge also show a high concentration on round numbers ( defined as those that fall on 5k, 10k, 20k, etc.)
summary(kickstart_data$pledged)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 30 620 9683 4076 20338986
In the previous plot we see that there are a some outliers that make the analysis hard withou any data transformation. We will see if this situation is also present once we look at fund raised in USD. As in the previous case we also provide summary statistics with the summary commad.
summary(kickstart_data$usd_pledged_real)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 31 624 9059 4050 20338986
We notice the same situation as with total the goal variable, a high presence of outliers (the fact that we used USD did not make a significant difference). We follow the same approach and filter values higher than 3.5 timesthe IQR. This leaves us with 341,526 observations. We now plot the histogram of projects by funding raised with no outliers in USD.
We can see that there are a lot of projects with no funds collected and no projects that collected money with round ‘numbers’ (5k, 10k, etc.). Let’s explore the distribution of projects by their status (live, canceled, etc.). In the next section of the analysis we will explore how status and projects collected correlate.
We can see there are a lot of projects are failed (more than half). Let’s now see how the project concentration changes by category (music, film, desing, etc.). This is an important step before we analyze correlations between different categorical variables.
The most common category in the dataset corresponds to projects in the category film and video, followed closely by music, publishing and games.
I am also interested in exploring how long are projects online. We did not get this information on the original dataset but we will use the dates of when the projects went online (launched) and when the dealine of the projects.
We can see there are some outliers due to parsing of the date. We filter those observations with a launch date with year equal to 1970. We repeat the same plot this time without the project with wrong date formatting.
We can see that the most common number of days for a project to be up is 30, with 60 as the second most common option.
The last variable I am interested in exploring is the number of backers of projects and how they are distributed. In the bivariete section we will explore how number of backers is correlated with success of a project. I also provide summary statistics with the summary command.
summary(kickstart_data$backers)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 2.0 12.0 105.6 56.0 219382.0
As with the other variables we see that there are several outliers in the data. We use the 3.5 IQR rule of thumb and plot the number of backers again.
The number of backers shows a distribution similar to funds obtained, with the majority of projects having 0 backers.
Finally I will explore where are projects located. For this analysis I will use the variable country. We use a simple barchart.
The most common place of origin of projects is the USA followed by GB and Canada.
There are 378,661 variables on this dataset measuring different aspects of projects that were launched on the kickstart plattform. The most important variables included in the dataset are:
category: indicates the subcategory of the project. This is a character variable that needs to be transformed to a factor. This has more than 50 levels and can get very specific.
main_category: indicates the main category of a project. This is a character variable that needs to be transformed to a factor. This has 14 levels. The most common category in the project is Film & Video, followed closely by Music.
deadline and launched: These are the dates when the project were launched and when the project was taken down. I created a new variable using the differences between the two dates (days_online).
goal: Amount of money the project wants to raise in local currency. It is a heavily skewed variable with some projects having a large value; the median project asked for $45,454 and the maximun amount asked was $16,6361,391. There is another variable that stardadizes all projects to USD (usd_goal_real).
pledged: Amount of money the project raised in local currency. It is a heavily skewed variable with some projects having a large value. There is another variable that stardadizes all projects to USD (usd_goal_real).
country: This variable captures where the projects are located. This is a character variable that should be consider as factor. The vast majority of projects come from the US, with almost 300,000.
One of the most interesting parts of the dataset comes from the fact that the majority of projects (more than 50 percent) fail. Also another interesting aspect of the project is the fact that the majority of them are online for 30 days.
I am interested in exploring the influence of days a project is online, category and subcategory in the amount of money and success of a project.
I created a new variable to understand how long projects are live in the plattform (days_online).
I had to omit outliers from the dataset to better visualize the distribution of data. The alternative would have been to use a logarithmic transformation. This approach was discarded since the distribution of the data did indeed follow a normal distribution. While we get information on the transformation and may use this transformation on a regression model it is hard to appreciate the concentration of projects on a log scale.
I am also interested in exploring differences in money collected by status of the project (succesful vs cancelled) and by category of project (film, food, etc.). So now I will explore the distribution of funding requested and funding obtained first by status of project and then by category. We use the same criteria of filtering by 3.5 above the IQR.
The first plot corresponds to funds requested by status. We use a series of ridgeplots to better visualize this relationship.
The projects show a very similar distribution between status. We will explore now if funding collected by status shows any difference with funds requested. Again we use a ridgeplot.
The funds collected by status of project show litte difference with the exception of projects that were succesful.
Let’s explote differences between funding requested and project category. I will use boxplots to have an easier time visualizing the distribution of funding by project category. We also filter outliers using the 3.5 IRQ rule mentioned previously.
We can see that there are some differences on the amount of money a project request by category. Projects on the design, food and technology have the highst median value.
Now that we have seen the distribution by category and state of the project let’s see how days_online, the variable created before, and funding pledged correlate.
We see that there is a high concentration of projects on the 30 days deadline and in the 60 days deadline. There seems to be no real correlation.
Finally I want to explore the correlation between number of backers and money pledged by projects. As in previous plots we filter outliers.
Before plotting some of the variables I was interested in understanding how different projects on the plattform differ on money raised and money they wanted to raise. I was also interested in exploring how the money a project requested may relate to its status (live, cancelled, failed, etc.) I was also interested in understanding if there was a clear relationship between number of backers of a given project and how they relate to total funding secured.
Some of the relationships followed an expected pattern such as number of backers and secured funds. The relationship there seems pretty linear, the more backers more funds a project secured.
Another relationship that showed an interesing pattern is the interplay between state of a project and actual funds secured. All the different categories have a high concentration around zero dollars raised (cancelled projects, unknown, failed, etc.). The big exception is projects that actually did received funding where we see a more variation on the amount of money raised.
An interesting pattern I observed was the fact that projects in the design and technology had the highest median value for funds requested. Given the nature of film projects I expected this to be the highest category.
Another interesting fact that I observed is the high concentration of funding requested around ‘round’ numbers such as 20k, 30k, 50k, etc. This type of relationship was also present in the the number of days a project was online and the funds requested.
The number of backers and the secured funding was a verys strong relationship. If we exclude the project with zero backers the relationship is quite clear.
Finally I wanted to explore the relationship between amount secured for a project, category of the project, and the state of the project.